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traffic lights. In this paper, the main emphasis is on evaluating two most 


promising deep learning architectures: single shot detector (SSD) and faster 
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Faster R-CNN (mAP@0.50) and average recall. The impact of data augmentation on the 
SSD two architectures is also analyzed. ResNet5O V1 as feature extractor for 
Traffic light detection Faster R-CNN achieved 96% mAP (mean average precision) which 


performed better than Original ResNet50 V1 Faster R-CNN pipeline. Also, 
different parameters such as batch size, learning rate and optimizer are tuned 
for detecting and classifying small traffic lights into different categories. 
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1. INTRODUCTION 

Recognizing traffic lights is an important and widely researched area in the field of autonomous 
driving. Conventional traffic light detection techniques face challenges in detecting small objects due to 
small representations, low resolution, deformable property, significant overlapping, and fewer pixels. They 
also suffer from false positives in complex backgrounds. Recently, deep learning algorithms have addressed 
the real-time tasks related to autonomous driving such as detecting traffic signals, traffic signs, and 
pedestrians. Deep learning for object detection has attracted the sight of the researchers with the evolution of 
region convolutinal network (R-CNN) [1]. Deep learning techniques are aimed to extract the features 
automatically and are highly accurate than the traditional machine learning techniques. Previous deep 
learning algorithms frequently encounter challenges when detecting small objects due to the disagreement 
between the spatial details and semantic information of deep convolutional neural networks (DCNNs). In 
complicated scenarios with identical backdrop objects and/or opacity, such as remote sensing imagery, the 
problem can be more difficult. Traffic lights occupy small area of the images. Moreover, small traffic light 
images are difficult to detect, and the background examples consists of trees, road, cars, and sky occupy large 
portion of the images. Li et al. [2] proposed that prior knowledge related to the context of traffic lights were 
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used to eliminate computational redundancy for traffic light recognition. The authors suggested a set of 
enhanced approaches such as aggregate channel feature method by altering each channel for traffic light type 
and further constructing a fusion detection mechanism. They elaborated an inter-frame information analysis 
method that employs previous frame detection information to alter initial proposal regions, thereby 
improving accuracy. Experiments were conducted on vision for intelligent vehicles and applications (VIVA) 
data set, the model performed well in contrast to previous traffic light identification methods. Kim et al. [3] 
presented an approach that comprised of two stages,segmentation technique that used connected component 
based labelling with an 8-connected neighbourhood to find the coordinates of the bounding boxes for finding 
the potential regions and convolutional network. 

The simulations showed that the presented technique outperformed traditional Faster R-CNN. 
Kim et al. [4] explored that input video's colour space and the deep learning network model play a vital role 
in designing the algorithm for detecting traffic lights. The authors used six colour spaces follwing Faster R- 
CNN and R-FCN deep learning architectures. They conducted the experiments on traffic light dataset with 
images having 1280 * 720 resolution and the simulation results suggested that using the RGB colour space 
alongwith Faster R-CNN technique together yields good results. These findings can be used to develop a 
comprehensive traffic light detecting system design guide. Jensen et al. [5] presented different challenges in 
the traffic light recognition (TLR) research. They examined different TLR systems and proposed the 
evaluation process for such systems. They also created a public dataset of U.S. roadways footage which 
comprised of video sequences shot with a stereo camera in various lighting and weather situations for 
comparing TLR systems. Munoz-Organero et al. [6] developed a novel mechanism for automatically 
detecting traffic lights, street crossings, and roundabouts for producing street maps. Janahiraman and 
Subuhan [7] evaluated TensorFlow object recognition framework [8] to handle different challenges. They 
developed technique for traffic light detection using MobileNetV2 single shot multibox detector (SSD), and 
Faster R-CNN.The results of the experiments showed that Faster R-CNN outperformed SSD by 38.806%. 
Wang et al. [9] investigated identification of the traffic lights based on deep learning. 

They proposed a region proposal technique based on different parameters such as color, intensity 
and geometry for detecting the traffic lights. They tested the model on 6804 photographs of diverse 
circumstances, the model achieved an average accuracy of 99.6%, recall of 99.2% and accuracy of detection 
as 98.5%. Wang et al. [10] designed simultaneous detection and tracking mechanism to ascertain color as 
well as position of traffic lights. Wang and Zhou [11] proposed a technique in which traffic light detection is 
performed on dark frames. In bright frames, this dual-channel method can fully utilize undistorted colour and 
shape information whereas it uses context in light frames. Hassan et al. [12] examined two approaches for 
detecting small objects. The first approach used traditional color-based segmentation method to recognize 
objects based on hue, saturation, and value parameters, whereas the second employs mask R-CNN to identify 
traffic light. In this paper, Hui et al. [13] presented a novel method that includes the generation of candidate 
traffic light regions based on genetic optimization. Localization and classification of traffic lights is based on 
deep neural networks that perform different tasks ranging from feature region extraction, parameter sampling 
of candidate region, and parameter optimization of the genetic algorithm. Gokul et al. [14] researchers 
examined model architecture and parameters of Faster R-CNN and you only look once (YOLO) detectors to 
recognise and categorise images in Bosch small traffic light dataset. 

In terms of precision, the experiment indicates that the Faster R-CNN model outperforms the YOLO 
model whereas YOLO, on the other hand, outperforms the competition when it comes to real-time 
deployment. Lu et al. [15] developed an attention model that produced a small number of probable regions 
that help small targets to be easily detected and further categorised. They curated a Tencent street dataset 
comprising of 15K instances with a wide diversity of traffic lights in street settings with different lighting 
situations, as well as forms than those found in the laboratory for intelligent and safe automobiles (LISA) 
dataset for traffic signal recognition. It was found that the discussed algorithm outperformed the conventional 
Faster R-CNN object detection framework. Muller and Dietmayer [16] implemented experiments on 
inception V3 model and employed non-maximum suppression algorithm on DriveU traffic light dataset 
(DTLD) for multiple detections avoidance on small objects. Cao et al. [17] proposed a novel enhanced loss 
function based on intersection over union (IoU) and used bilinear interpolation to improve the position 
information. There is improvement in the recall as well as the accuracy values for the small objects. Patel and 
Thakkar [18] discussed the role of AI in different application areas. Jasm et al. [19] implemented the image 
classification using convolutional neural network (CNN) on Canadian Institute for Advanced Research, 10 
classes (CIFAR-10) dataset. Kadim et al. [20] used pre-trained CNN alongwith fully connected layers to 
handle challenges for dealing with chages in appearance. Also different hyperparameters, learning rate and 
ratio of training samples are otimized for tracking in night conditions. Liu and Yan [21] discussed the use of 
capsule networks for traffic light detection. In literature it has been found that detecting the small objects is a 
challenging problem due to small no of the pixels, varying background, small resolution which need to be 
researched. The existing deep learning architectures perform better on larger objects but for smaller objects 
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there is a performance issue. In this paper experiments are performed on the SSD pipeline with feature 
pyramid networks (FPN) and Faster R-CNN using ResNet50 V1 as feature extractor and it is found that after 
adding the data augmentation to the Faster R-CNN with ResNet50 V1 as pretrained model performed better 
than existing techniques. 


2. RELATED WORK 
2.1. Two stage object detection algorithm: Faster R-CNN 

RCNN families of the algorithms were the first designed algorithms. Faster R-CNN is improvision 
over Fast R-CNN. Fast R-CNN [1] used selective search algorithm for generating regions in which object 
could be present. In case the dataset is large using Fast R-CNN resulted in huge number of regions being 
generated. Faster R-CNN [22] is a neural network-based algorithm which_framework incorporates both 
region proposal network (RPN) and Fast R-CNN algorithms for region generation and object detection 
respectively. It utilizes RPN to create object proposals and reduces proposal generation time taken by R-CNN 
algorithms using selective search algorithm of 2s to 10ms per image. It also allows sharing of layers between 
region proposal stage and detection stages, thus enhancing the feature representation. The architecture [19] 
for the model is shown in Figure 1. 
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Figure 1. Faster R-CNN deep learning neural network architecture [20] 


2.1.1 Convolutional neural network 

CNN are used as feature extractor/backbone to which an image is passed. As discussed in the paper 
[19], there are two possible CNNs: VGG-16 (13 shareable convolutional layers) or ZFnet (5 shareable 
convolutional layers) which have network stride of 16 that means if the input image has (1000x600) 
dimensions then it generates (1000/16 x 600/16) ~= (62 x 37) sized feature map. In Figure 1 VGG-16 is 
shown as the feature extractor that comprises of a series of 3x3 convolutional layers with stride of 1 and 
padding of 1 and 2x2 Maxpooling with stride of 2. Layer 13 is the last convolutional layer that is used as 
input to the (RPN) network. 


2.1.2 Region proposal network 

The input to the RPN is the convolutional feature map generated by the backbone network and 
output are the anchors (potential bounding box candidates where there is a possibility of object presence) 
obtained by applying sliding window convolution. For each and every point on the output feature map, the 
networks outputs if an object is present or not by placing k=9 (default) anchors in three distinct scales with 
box region 1282, 2562, 512? and three aspect ratios of 1:1, 1:2 and 2:1. Therefore for a convolutional feature 
map (w x h), a total of n = w x h x 9 anchors are generated. These anchors are additionaly refined to generate 
bounding boxes. Then these region proposals are passed into a 3x3 convolution layer with padding of 1 and 
generates 512 dimensional feature map for every location for VGG-16. RPN network generates the region 
proposals from anchors. The output generated is passed to two 1x1 convolution layers. One is the 
classification layer with output having size (h, w, 18) which outputs probabilities of the bounding box to have 
an object or not. The total number of the output parameters are 2*n (w * h * (2*k)). The other is the 
regression layer with output having size (h, w, 36) that generated 4 regression coefficients for all the 
9 anchors related to every point present in backbone feature map. The total number of the output parameters 
are 4*n (w * h * (4*k)). Due to sharing computation on convolutional features, RPN reduces the cost, 


Indonesian J Elec Eng & Comp Sci, Vol. 26, No. 3, June 2022: 1486-1494 


Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 O 1489 


improves accuracy, and reduces running time while also avoiding the generation of superfluous proposal 
boxes improves the accuracy of Faster R-CNN but slows down its processing speed. The output feature map 
comprises of 40x60 locations and a total of 40x60x9 anchors. An anchor is considered to contain an object if 
IoU (Intersection over Union) is more than 0.7 value with any ground-truth box. An anchor does not contain 
an object if its IoU with all groundtruth boxes is less than 0.3. The total number of the anchors is 
40x60x9=21,600 which is very large and in order to reduce this count, anchors that are out of the boundary of 
an image are ignored and non-maximum suppression algorithm is also applied which results in reducing the 
number of anchors to 2000 which are used for training. The loss in RPN is calculated in (1): 


L(pi ti) = 2i Le (po Pi) + ALi Lr (te ti) (1) 


where N, : No of anchors in minibatch(256), N, : No of anchors(~2000), Lo : Cross entropy(Softmax) loss for 
classifier, L, : Smooth loss for bounding box regressor that gets triggered in case anchor contains an object, 
pi: predicted probability of anchors contains an object or not , př: ground truth probability(0: for negative 
anchors, 1: for positive anchors), t;: predicted box coordinates, tř: ground truth predicted box. 


2.1.3 Fast-RCNN (detector) 

The detection algorithm constitutes a pretrained CNN for feature extraction, region of interest 
(ROI) pooling layer, fully connected layers, softmax layer for classification and regression layer. ROI 
pooling is combined with Fast-RCNN for making the detection pipeline. ROI pooling layer makes the 
different sized region proposals that are generated by RPN into fixed-size feature map of size (7 x 7 x D) 
(D=256 for VGG-16, D=512 for ZFnet). This feature map is further directed to two fully connected layers. 
One among them is the classification layer or the softmax layer which has n+1 parameters (n: number of 
classes) that generates the classification score that depicts the probability for each class proposals. The 
regression layer has 4xn parameters and uses coefficients to improve the predicted bounding boxes. ROI are 
considered as object proposals if IoU is greater than or equal to 0.5. ROI that are positive for a class are 
labelled as belonging to that class (u=1....n), ROI that belong to the background class have u=0.The multi- 
task loss for each ROI is calculated in (2) as combination of two losses: classification loss and regression 
loss. 


Fast R — CNN loss = Le (p,u) + AL, (t",v) (2) 


The classification loss L, (p, u) is calculated for every ROI over (n+1) classes. Lc(p, u) = -log (p,,): 
log loss for true class u. L, (t” , v): Smooth L1 loss for regression. The regression layer generates regression 
offsets, t¥where i = (x,y,w,h), ( x,y) refers to the top left corner and w, h refers to bounding box width 
and height respectively v; refer to true bounding box regression targets. 


2.2. One stage object detection algorithm: single shot detector 

One stage detector solves the problem of object detection as a simple regression problem by taking 
an input image and learning the class probabilities and bounding box coordinates. The architecture of SSD is 
explained below. 


2.2.1 Backbone network 

The main goal of backbone network is feature extractions. There are two versions SSD300 and 
SSD500. In this paper [23], SSD300 is considered. VGG-16 architecture which is pretrained on ImageNet 
dataset for image classification is used for feature extraction and generation of the feature maps. The 
different feature maps used to piece together SSD300 with VGG16 base network are: conv4_3, fc7, conv8_2, 
conv9_2, conv10_2 and conv11_2. 


2.2.2 Convolutional layers for prediction 

Convolutional feature layers are used to augment the shortened base network. Convolutional 
predictors prevent the loss of the spatial informationgiven by the feature maps and also generate less no of 
the parameters as the predictions produced are of the size 38 x 38 x number_of_classes. SSD uses the 
auxillary layers to extract the features at the multiple scales and these layers gradually reduce the size of the 
Input. The architecture [20] in Figure 2 generates bounding boxes and scores for objects present in those 
boxes using a feed-forward convolutional network, then applies the non-maximum suppression (NMS) 
algorithm to obtain final result. The first layers in the network are based on a standard architecture for high- 
quality image classification. These layers gradually reduce in size, allowing detections at various scales to be 
predicted. To forecast detections, a new convolutional model is utilised for each feature layers. Using 
convolutional filters at each layer generates 8732 predictions per object. The detections are performed at 
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different scales, fc6 layers and fc7 layers of VGG16 networks are changed into convolutional layers, pool5 
layer’s pool size changed from (2,2) to (3,3) with stride of 1. To the conv4 3 layer of VGG16, L2 
normalization is added. SSD training objective is the weighted sum of confidence loss (L,) and localization 
loss (L,) as depicted in (3), (4), and (5) respectively. 
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Figure 2. Single shot detector architecture [19] 


L(x,c, 1, g) = Z (Le (x,c) + aL, (x, l,g)) (3) 
= N k m am 
L, (x, L g9) = Misros Lmatexay wiht Xij smooth; (l = Gj ) (4) 
where 


O° = (GF - df )/aP 
GP = (GF — dP) al 
Gy = log(gi" /d}") 
g} = log(g!/d?) ; 
exp(c; ) (5) 


z N p AP N a0 a 
Lc (x,c) = — Yiepos xplog(é) ) — Lieneg log(¢;) where ĉ; = Lp exp(c?) 


where N: number of matched default boxes, c: number of classes, æ: weight term, localization loss, L, is the 
smooth L1 loss between predicted box (1) and ground box (g). The transformations are then performed for 
bounding box corrections to center (cx, cy) of the default box (d) with width (w) and height (h). 


3. RESEARCH METHOD 
In this paper, ResNet50 V1 model is used as the feature extractor in the Faster R-CNN framework as 


shown in Figure 3. Different parameters such as batch size, learning rate, epochs, step-size are fine tuned for 
the traffic lights present in the dataset. Different data augmentation techniques applied are brightness, 
contrast, hue and saturation to increase the diversity of the dataset and improve the performance on the small 


traffic lights in the dataset. 
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Figure 3. Data augmented Faster R-CNN architecture 
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4. RESULTS AND DISCUSSION 
4.1. Experimental setup 

The experiments were performed on Google Colab using Tesla K80 GPU. The software used are 
Tensorflow object detection API framework, Anaconda virtual environment included CUDNN 7.6, CUDA 
10, Python 3.8. The evaluation is performed on subset of la route automatisée (LaRA) traffic light dataset 
provided [23]. The training of the different architectures had been run for 6000 steps each with a batch size of 
4. The object detection results are evaluated on the metrics: mean average precision and recall. The training 
and testing images are resized to resolution 640x640 pixels as per the requirement of different frameworks. 
For experimentation different data augmentation transformations applied to training dataset such as random 
brightness with value of 0.2, random contrast in range (0.7,0.11), random hue with value of 0.01 and random 
saturation in the range (0.75,1.15). For conducting experiments of deep convolutional neural network-based 
object detection, subset of the benchmark dataset, LaRA traffic light dataset is used. The original dataset 
[24], [25] contains 11179 8-bit red, green, blue (RGB) traffic light images and 9168 annotated traffic lights 
with resolution of 640x480 divided into four classes namely go, stop, warning and ambiguous. In Figure 4, 
different types of the images are presented depicting different challenges like blur as shown in Figure 4(a), 
multiple objects as shown in Figure 4(b), small-size objects as shown in Figure 4(c) while performing 
detection. 


Figure 4. Examples of frames from LARA traffic light dataset; (a) blurred 
image, (b) multiple objects, and (c) small-sized object 


4.2. Evaluation metrics 

In evaluating the object detection model for traffic light detection, precision in (6) is defined as the 
ratio of total true positives traffic light samples to total number of positive predictions and Recall in (7) refers 
to ratio of total true positives traffic light samples to total number of actual/relevant samples. These two 
parameters have trade-off relationship which means that one improves at the expense of other. 


ed TL 
Precision = ——=— (6) 
TLp+ FLp 
TL 
Recall = ——=— (7) 
TLp+ FLN 


IoU is the amount of overlap between predicted boundary box with ground truth box for object 
detection. A prediction is marked as true positive if IoU is above a certain threshold. IoU varies between 0.5 
and 0.95. Average precision (AP) is a metric used to calculate accuracy of object detection model. It indicates 
how well the model handles the positives. For common objects in context (COCO), AP is computaed by 
taking average over several IOU AP@ [.5:.95] is definesd as average AP for IoU from the range 0.5 to 0.95 
with a step size of 0.05. Mean average precision (mAP) is calculated as mean of AP score over all classes 
across all IoU thresholds. 


4.3. Discussion of the results 

In this paper, ResNet50 V1 pretrained model on ImageNet database is used as the feature extractor 
and fine-tuned on the LaRA traffic light dataset. RPN is trained on mini-batch, and then RPN and base 
network parameters are updated. Then, both the positive and negative proposals generated by RPN are used 
for training and further updating the classifier. The loss function and parameterization of coordinates for 
bounding box regression are kept as in original Faster R-CNN. Stochastic gradient descent (SGD) is used as 
optimizer with initial learning rates of the RPN and classifier set to 0.04 with the learning rate decay of 
0.0005 per batch. In Figure 5-7 the results of the different frameworks such as SSD Resnet50 V1 FPN, Faster 
R-CNN ResNet50 V1, data augmented Faster R-CNN ResNet50 V1 architecture are shown respectively. The 
network is trained for 6000 steps. Different loss graphs have been illustrated such as classification loss in 
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Figure 5(a), localization loss in Figure 5(b), regularization loss in Figure 5(c), total loss in Figure 5(d) for 
SSD Resnet5O V1 FPN architecture,in Figure 6(a) BoxClassifier: classification loss is presented, 
BoxClassifier: localization loss is presented in Figure 6(b), RPN: objectness loss is presented in Figure 6(c), 
RPN: localization loss is presented in Figure 6(d), total loss in Figure 6(e), regularization loss for Faster R- 
CNN ResNet50 V1 architecture, in Figure 7(a) BoxClassifier: classification loss is presented, BoxClassifier: 
localization loss is presented in Figure 7(b), RPN: objectness loss is presented in Figure 7(c), RPN: 
localization loss is presented in Figure 7(d), total loss in Figure 7(e), regularization loss for data augmented 
modified Faster R-CNN ResNet50 V1 architecture, where x-axis represent the total number of the steps and 
the y-axis represent different losses respectively.The loss is decreasing continuously with each step till value 
of 0.2. Table 2 depicts the results of the different architectures in terms of mean average precision score and 
it is exp erimentally found that the Faster R-CNN based model performs better by 11% increase in the mean 
average precision score than SSD based model. Also there is an increase in 2% mean average precision score 
after data augmentation is performed. 
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Figure 5. Losses in SSD Resnet 50 V1 FPN detection framework for (a) classification loss, 
(b) localization loss, (c) regularization loss, and (d) total loss 
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Figure 6. Losses in Faster R-CNN Resnet50 V1 detection framework for (a) BoxClassifier: classification 
loss, (b) BoxClassifier: localization loss, (c) RPN: objectness loss, (d) RPN: localization loss, and 
(e) total loss 
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Figure 7. Losses in modified Faster R-CNN Resnet50 V1 detection framework for (a) BoxClassifier: 
classification loss, (b) BoxClassifier localization loss, (c) RPN: objectness loss, (d) RPN: localization loss, 
and (e) total loss 


Table 2. Results for different models on LaRa traffic light dataset 


Model Resolution (in pixels) mAP (in %) AR (in %) 
SSD Resnet50 V1 FPN 640x640 83 62 
Faster R-CNN Resnet50 V1 640x640 94 43 
Data augmented +Faster R-CNN Resnet50 V1 640x640 94 40 


5. CONCLUSION 

In this paper the problem of detecting the small traffic light in the images on the challenging dataset 
is addressed. Also a comparison is performed between two existing architectures namely SSD and Faster R- 
CNN with ResNet50 V1 as the pretrained feature extractor. It has been found that Faster R-CNN ResNet50 
V1 achieved a higher mAP of 94% as compared to SSD ResNet50 FPN V1 of 83%. Further data 
augmentation is applied to the dataset during the training process and and improvement of 2% in mAP score 
of Faster R-CNN ResNet50 V1 is achieved. The results together with the systematic study led to performance 
gains for detecting the small traffic lights. In future further different architectures can be explored for 
performance on the dataset. 
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